A Surprisingly Robust Trick for the Winograd Schema Challenge 
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Abstract 


The Winograd Schema Challenge (WSC) da- 
taset WSC273 and its inference counterpart 
WNLI are popular benchmarks for natural lan- 
guage understanding and commonsense rea- 
soning. In this paper, we show that the perfor- 
mance of three language models on WSC273 
consistently and robustly improves when fine- 
tuned on a similar pronoun disambiguation 
problem dataset (denoted WSCR). We addi- 
tionally generate a large unsupervised WSC- 
like dataset. By fine-tuning the BERT lan- 
guage model both on the introduced and on 
the WSCR dataset, we achieve overall accu- 
racies of 72.5% and 74.7% on WSC273 and 
WNLI, improving the previous state-of-the- 
art solutions by 8.8% and 9.6%, respectively. 
Furthermore, our fine-tuned models are also 
consistently more accurate on the “complex” 
subsets of WSC273, introduced by Trichelair 
et al. (2018). 


1 Introduction 


The Winograd Schema Challenge (WSC) (Leves- 
que et al., 2012, 2011) was introduced for testing 
AI agents for commonsense knowledge. Here, we 
refer to the most popular collection of such sen- 
tences as WSC273, to avoid confusion with other 
slightly modified datasets, such as PDP60, (Davis 
et al., 2017) and the Definite Pronoun Resolution 
dataset (Rahman and Ng, 2012), denoted WSCR 
in the sequel. WSC273 consists of 273 instan- 
ces of the pronoun disambiguation problem (PDP) 
(Morgenstern et al., 2016). Each is a sentence (or 
two) with a pronoun referring to one of the two or 
more nouns; the goal is to predict the correct one. 
The task is challenging, since WSC examples are 
constructed to require human-like commonsense 
knowledge and reasoning. The best known solu- 
tions use deep learning with an accuracy of 63.7% 
(Opitz and Frank, 2018; Trinh and Le, 2018). The 
problem is difficult to solve not only because of the 
commonsense reasoning challenge, but also due 
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to the small existing datasets making it difficult to 
train neural networks directly on the task. 

Neural networks have proven highly effective 
in natural language processing (NLP) tasks, out- 
performing other machine learning methods and 
even matching human performance (Hassan et al., 
2018; Nangia and Bowman, 2018). However, su- 
pervised models require many per-task annotated 
training examples for a good performance. For 
tasks with scarce data, transfer learning is often 
applied (Howard and Ruder, 2018; Johnson and 
Zhang, 2017), i.e., a model that is already trained 
on one NLP task is used as a starting point for 
other NLP tasks. 

A common approach to transfer learning in 
NLP is to train a language model (LM) on large 
amounts of unsupervised text (Howard and Ruder, 
2018) and use it, with or without further fine-tu- 
ning, to solve other downstream tasks. Build- 
ing on top of a LM has proven to be very suc- 
cessful, producing state-of-the-art (SOTA) results 
(Liu et al., 2019; Trinh and Le, 2018) on bench- 
mark datasets like GLUE (Wang et al., 2019) or 
WSC273 (Levesque et al., 2011). 

In this work, we first show that fine-tuning 
existing LMs on WSCR is a robust method of 
improving the capabilities of the LM to tackle 
Wsc273 and WNLI. This is surprising, because 
previous attempts to generalize from the WSCR 
dataset to WSC273 did not achieve a major im- 
provement (Opitz and Frank, 2018). Secondly, 
we introduce a method for generating large-scale 
WSc-like examples. We use this method to create 
a 2.4M dataset from English Wikipedia!, which 
we further use together with WSCR for fine- 
tuning the pre-trained BERT LM (Devlin et al., 
2018). The dataset will be made publicly avail- 
able. We achieve accuracies of 72.5% and 74.7% 
on WSC273 and WNLI, improving the previous 


'https://dumps.wikimedia.org/enwiki/ 
dump id: enwiki-20181201 


4837 


Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4837-4842 
Florence, Italy, July 28 - August 2, 2019. ©2019 Association for Computational Linguistics 


best solutions by 8.8% and 9.6%, respectively. 


2 Background 


This section introduces the main LM used in our 
work, BERT (Devlin et al., 2018), followed by a 
detailed description of WSC and its relaxed form, 
the Definite Pronoun Resolution problem. 


BERT. Our work uses the pre-trained Bidirec- 
tional Encoder Representations from Transform- 
ers (BERT) LM (Devlin et al., 2018) based on 
the transformer architecture (Vaswani et al., 2017). 
Due to its high performance on natural language 
understanding (NLU) benchmarks and the sim- 
plicity to adapt its objective function to our fine- 
tuning needs, we use BERT throughout this work. 

BERT is originally trained on two tasks: masked 
token prediction, where the goal is to predict the 
missing tokens from the input sequence, and next 
sentence prediction, where the model is given two 
sequences and asked to predict whether the second 
sequence follows after the first one. 

We focus on the first task to fine-tune BERT us- 
ing Wsc-like examples. We use masked token 
prediction on a set of sentences that follow the 
WSC structure, where we aim to determine which 
of the candidates is the correct replacement for the 
masked pronoun. 


Winograd Schema Challenge. Having introdu- 
ced the goal of the Winograd Schema Challenge 
in Section 1, we illustrate it with the following ex- 
ample: 

The trophy didn’t fit into the suitcase because it 
was too [large/small]. 

Question: What was too [large/small]? 

Answer: the trophy / the suitcase 


The pronoun “it” refers to a different noun, 
based on the word in the brackets. To correct- 
ly answer both versions, one must understand the 
meaning of the sentence and its relation to the 
changed word. More specifically, a text must meet 
the following criteria to be considered for a Wino- 
grad Schema (Levesque et al., 2011): 

1. Two parties must appear in the text. 

2. A pronoun appears in the sentence and refers 
to one party. It would be grammatically cor- 
rect if the pronoun referred to the other. 

3. The question asks to determine what party the 
pronoun refers to. 

4. A “special word” appears in the sentence. 
When switched to an “alternative word”, the 


sentence remains grammatically correct, but 
the referent of the pronoun changes. 


Additionally, commonsense reasoning must be 
required to answer the question. 

A detailed analysis by Trichelair et al. (2018) 
shows that not all WSC273 examples are equally 
difficult. They introduce two complexity mea- 
sures (associativity and switchability) and, based 
on them, refine evaluation metrics for WSC273. 

In associative examples, one of the parties is 
more commonly associated with the rest of the 
question than the other one. Such examples are 
seen as “easier” than the rest and represent 13.5% 
of WSC273. The remaining 86.5% of WSC273 is 
called non-associative. 

47% of the examples are “switchable”, because 
the roles of the parties can be changed, and ex- 
amples still make sense. A model is tested on the 
original, “unswitched” switchable subset and on 
the same subset with switched parties. The con- 
sistency between the two results is computed by 
comparing how often the model correctly changes 
the answer when the parties are switched. 


Definite Pronoun Resolution. Since collecting 
examples that meet the criteria for WSC is hard, 
Rahman and Ng (2012) relax the criteria and 
construct the Definite Pronoun Resolution (DPR) 
dataset, following the structure of WSC, but also 
accepting easier examples. The dataset, referred 
throughout the paper as WSCR, is split into a train- 
ing set with 1322 examples and test set with 564 
examples. Six examples in the WSCR training set 
reappear in WSC273. We remove these examples 
from WSCR. We use the WSCR training and test 
sets for fine-tuning the LMs and for validation, re- 
spectively. 


WNLI. One of the 9 GLUE benchmark tasks 
(Wang et al., 2019), WNLI is very similar to the 
WSC273 dataset, but is phrased as an entailment 
problem instead. A WSC schema is given as a 
premise. The hypothesis is constructed by extract- 
ing the sentence part where the pronoun is, and 
replacing the pronoun with one candidate. The la- 
bel is 1, if the candidate is the correct replacement, 
and 0, otherwise. 


3 Related Work 


There have been several attempts at solving 
WSC273. Previous work is based on Google 
queries for knowledge (Emami et al., 2018) (58%), 
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sequence ranking (Opitz and Frank, 2018) (63%), 
and using an ensemble of LMs (Trinh and Le, 
2018) (63%). 

A critical analysis (Trichelair et al., 2018) 
showed that the main reason for success when us- 
ing an ensemble of LMs (Trinh and Le, 2018) was 
largely due to imperfections in WSC273, as dis- 
cussed in Section 2. 

The only dataset similar to WSC273 is an eas- 
ier but larger (1886 examples) variation published 
by Rahman and Ng (2012) and earlier introduced 
as WSCR. The sequence ranking approach uses 
WSCR for training and attempts to generalize to 
WSC273. The gap in performance scores between 
WSCR and WSC273 (76% vs. 63%) implies that 
examples in WSC273 are much harder. We note 
that Opitz and Frank (2018) do not report remov- 
ing the overlapping examples between WSCR and 
WSC273. 

Another important NLU benchmark is GLUE 
(Wang et al., 2019), which gathers 9 tasks and is 
commonly used to evaluate LMs. The best score 
has seen a huge jump from 0.69 to over 0.82 in a 
single year. However, WNLI is a notoriously diffi- 
cult task in GLUE and remains unsolved by the ex- 
isting approaches. None of the models have beaten 
the majority baseline at 65.1, while human perfor- 
mance lies at 95.9 (Nangia and Bowman, 2018). 


4 Our Approach 


Wsc Approach. We approach WSC by fine- 
tuning the pre-trained BERT LM (Devlin et al., 
2018) on the WSCR training set and further on 
a very large Winograd-like dataset that we intro- 
duce. Below, we present our fine-tuning objective 
function and the introduced dataset. 

Given a training sentence s, the pronoun to be 
resolved is masked out from the sentence, and the 
LM is used to predict the correct candidate in the 
place of the masked pronoun. Let cı and c3 be the 
two candidates. BERT for Masked Token Predic- 
tion is used to find P(c;|s) and P(cg|s). If a candi- 
date consists of several tokens, the corresponding 
number of [MASK] tokens is used in the masked 
sentence. Then, log P(c|s) is computed as the av- 
erage of log-probabilities of each composing to- 
ken. If cı is correct, and cs is not, the loss is: 


L = — log P(c\|s) + (d) 
+ a - max (0, log P(c2|s) — log P(c1|s) + 8), 


where a and 8 are hyperparameters. 


MaskedWiki Dataset. To get more data for 
fine-tuning, we automatically generate a large- 
scale collection of sentences similar to WSC. 
More specifically, our procedure searches a large 
text corpus for sentences that contain (at least) two 
occurrences of the same noun. We mask the sec- 
ond occurrence of this noun with the [MASK] to- 
ken. Several possible replacements for the masked 
token are given, for each noun in the sentence dif- 
ferent from the replaced noun. We thus obtain 
examples that are structurally similar to those in 
WSC, although we cannot ensure that they fulfill 
all the requirements (see Section 2). 

To generate such sentences, we choose the En- 
glish Wikipedia as source text corpus, as it is a 
large-scale and grammatically correct collection 
of text with diverse information. We use the Stan- 
ford POS tagger (Manning et al., 2014) for find- 
ing nouns. We obtain a dataset with approximately 
130M examples. We downsample the dataset uni- 
formly at random to obtain a dataset of manage- 
able size. After downsampling, the dataset con- 
sists of 2.4M examples. All experiments are con- 
ducted with this downsampled dataset only. 

To determine the quality of the dataset, 200 ran- 
dom examples are manually categorized into 4 cat- 
egories: 


e Unsolvable: the masked word cannot be un- 
ambiguously selected with the given context. 
Example: Palmer and Crenshaw both used 
Wilson 8802 putters , with [MASK] ’s receiv- 
ing the moniker “ Little Ben ” due to his pro- 
ficiency with it . [Palmer/Crenshaw] 


e Hard: the answer is not trivial to figure out. 
Example: At the time of Plath ’s suicide , As- 
sia was pregnant with Hughes ’s child , but 
she had an abortion soon after [MASK] ’s 
death . [Plath/Assia] 


Easy: The alternative sentence is grammati- 
cally incorrect or is very visibly an inferior 
choice. Example: The syllables are pro- 
nounced strongly by Gaga in syncopation 
while her vibrato complemented Bennett's 
characteristic jazz vocals and swing . Olivier 
added , “ [MASK] ’s voice , when stripped 
of its bells and whistles, showcases a time- 
lessness that lends itself well to the genre . ” 
[Gaga/syncopation] 


Noise: The example is a result of a parsing 
error. 


4839 


In the analyzed subset, 8.5% of examples were un- 
solvable, 45% were hard, 45.5% were easy, and 
1% fell into the noise category. 


WNLI Approach. Models are additionally 
tested on the test set of the WNLI dataset. To use 
the same evaluation approach as for the WSC273 
dataset, we transform the examples in WNLI 
from the premise—-hypothesis format into the 
masked words format. Since each hypothesis is 
just a substring of the premise with the pronoun 
replaced for the candidate, finding the replaced 
pronoun and one candidate can be done by finding 
the hypothesis as a substring of the premise. 
All other nouns in the sentence are treated as 
alternative candidates. The Stanford POS-tagger 
(Manning et al., 2014) is used to find the nouns in 
the sentence. The probability for each candidate 
is computed to determine whether the candidate 
in the hypothesis is the best match. Only the test 
set of the WNLI dataset is used, because it does 
not overlap with Wsc273. We do not train or 
validate on the WNLI training and validation sets, 
because some of the examples share the premise. 
Indeed, when upper rephrasing of the examples is 
used, the training, validation, and test sets start to 
overlap. 


5 Evaluation 


In this work, we use the PyTorch implementa- 
tion? of Devlin et al.’s (2018) pre-trained model, 
BERT-large. To obtain BERT_WIKI, we train on 
MaskedWiki starting from the pre-trained BERT. 
The training procedure differs from the training 
of BERT (Devlin et al., 2018) in a few points. 
The model is trained with a single epoch of the 
Masked Wiki dataset, using batches of size 64 (dis- 
tributed on 8 GPUs), Adam optimizer, a learn- 
ing rate of 5.0 - 1076, and hyperparameter val- 
ues a = 20 and 6 = 0.2 in the loss function 
(Eq. (1)). The values were selected from a € 
{5, 10,20} and 8 € {0.1,0.2,0.4} and learning 
rate from {3 - 1075, 1 < 1075,5 - 1076,3 - 1076} 
using grid search. To speed up the hyperparame- 
ter search, the training (for hyperparameter search 
only) is done on a randomly selected subset of size 
100,000. The performance is then compared on 
the WSCR test set. 

Both BERT and BERT_WIKI are fine-tuned on 
the WSCR training dataset to create BERT_WSCR 
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and BERT_WIKI_WSCR. 

The WSCR test set was used as the validation 
set. The fine-tuning procedure was the same as the 
training procedure on MaskedWiki, except that 30 
epochs were used. The model was validated af- 
ter every epoch, and the model with highest per- 
formance on the validation set was retained. The 
hyperparameters a and 8 and learning rate were 
selected with grid search from the same sets as for 
MaskedWiki training. 

For comparison, experiments are also con- 
ducted on two other LMs, BERT-base (BERT with 
less parameters) and General Pre-trained Trans- 
former (GPT) by Radford et al. (2018). The train- 
ing on BERT-base was conducted in the same way 
as for the other models. When using GPT, the 
probability of a word belonging to the sentence 
P(c|s) is computed as partial loss in the same way 
as by Trinh and Le (2018). 

Due to WSC’s “special word” property, exam- 
ples come in pairs. A pair of examples only differs 
in a single word (but the correct answers are dif- 
ferent). The model BERT_WIKI_WSCR_no_pairs 
is the BERT_WIKI model, fine-tuned on WSCR, 
where only a single example from each pair is 
retained. The size of WSCR is thus halved. 
The model BERT_WIKI_WSCR pairs is obtained 
by fine-tuning BERT_WIKI on half of the WSCR 
dataset. This time, all examples in the subset come 
in pairs, just like in the unreduced WSCR dataset. 

We evaluate all models on WSC273 and the 
WNLI test dataset, as well as the various subsets 
of WScC273, as described in Section 2. The re- 
sults are reported in Table 1 and will be discussed 
next. 


Discussion. Firstly, we note that models that are 
fine-tuned on the WSCR dataset consistently out- 
perform their non-fine-tuned counterparts. The 
BERT_WIKI_WSCR model outperforms other lan- 
guage models on 5 out of 6 sets that they are com- 
pared on. In comparison to the LM ensemble by 
Trinh and Le (2018), the accuracy is more consis- 
tent between associative and non-associative sub- 
sets and less affected by the switched parties. 
However, it remains fairly inconsistent, which is 
a general property of LMs. 

Secondly, the results of BERT_WIKI seem to in- 
dicate that this dataset alone does not help BERT. 
However, when additionally fine-tuned to WSCR, 
the accuracy consistently improves. 

Finally, the results of BERT_WIKI_no_pairs 
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Wsc273 non-assoc. assoc. unswitched switched consist. WNLI 
BERT_WIKI 0.619 0.597 0.757 0.573 0.603 0.389 0.712 
BERT_WIKI_LWSCR 0.725 0.720 0.757 0.732 0.710 0.550 0.747 
BERT 0.619 0.602 0.730 0.595 0.573 0.458 0.658 
BERT_WSCR 0.714 0.699 0.811 0.695 0.702 0.550 0.719 
BERT-base 0.564 0.551 0.649 0.527 0.565 0.443 0.630 
BERT-base_WSCR 0.623 0.606 0.730 0.611 0.634 0.443 0.705 
GPT 0.553 0.525 0.730 0.595 0.519 0.466 — 
GPT_WSCR 0.674 0.653 0.811 0.664 0.580 0.641 — 
BERT_WIKI_WSCR_no_pairs 0.663 0.669 0.622 0.672 0.641 0.511 — 
BERT_WIKI_WSCR_pairs 0.703 0.695 0.757 0.718 0.710 0.565 - 
LM ensemble 0.637 0.606 0.838 0.634 0.534 0.443 — 
Knowledge Hunter 0.571 0.583 0.5 0.588 0.588 0.901 - 


Table 1: Results on WSC273 and its subsets. The comparison between each language model and its WSCR-tuned 
model is given. For each column, the better result of the two is in bold. The best result in the column overall 
is underlined. Results for the LM ensemble and Knowledge Hunter are taken from Trichelair et al. (2018). All 
models consistently improve their accuracy when fine-tuned on the WSCR dataset. 


and BERT_WIKI_pairs show that the existence of 
WSc-like pairs in the training data affects the per- 
formance of the trained model. MaskedWiki does 
not contain such pairs. 


6 Summary and Outlook 


This work achieves new SOTA results on the 
Wsc273 and WNLI datasets by fine-tuning the 
BERT language model on the WSCR dataset and a 
newly introduced MaskedWiki dataset. The previ- 
ous SOTA results on WSC273 and WNLI are im- 
proved by 8.8% and 9.6%, respectively. To our 
knowledge, this is the first model that beats the 
majority baseline on WNLI. 

We show that by fine-tuning on WSC-like data, 
the language model’s performance on WSC con- 
sistently improves. The consistent improvement 
of several language models indicates the robust- 
ness of this method. This is particularly surprising, 
because previous work (Opitz and Frank, 2018) 
implies that generalizing to WSC273 is hard. 

In future work, other uses and the statistical 
significance of Masked Wiki’s impact and its ap- 
plications to different tasks will be investigated. 
Furthermore, to further improve the results on 
Wsc273, data-filtering procedures may be intro- 
duced to find harder Wsc-like examples. 
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